To complete this milestone, you can either edit this
.rmd file directly. Fill in the sections that are
commented out with <!--- start your work here--->.
When you are done, make sure to knit to an .md file by
changing the output in the YAML header to github_document,
before submitting a tagged release on canvas.
In Milestone 1, you explored your data. and came up with research questions. This time, we will finish up our mini data analysis and obtain results for your data by:
We will also explore more in depth the concept of tidy data.
NOTE: The main purpose of the mini data analysis is to integrate what you learn in class in an analysis. Although each milestone provides a framework for you to conduct your analysis, it’s possible that you might find the instructions too rigid for your data set. If this is the case, you may deviate from the instructions – just make sure you’re demonstrating a wide range of tools and techniques taught in this class.
To complete this milestone, edit this
very .Rmd file directly. Fill in the sections that are
tagged with <!--- start your work here--->.
To submit this milestone, make sure to knit this
.Rmd file to an .md file by changing the YAML
output settings from output: html_document to
output: github_document. Commit and push all of your work
to your mini-analysis GitHub repository, and tag a release on GitHub.
Then, submit a link to your tagged release on canvas.
Points: This milestone is worth 50 points: 45 for your analysis, and 5 for overall reproducibility, cleanliness, and coherence of the Github submission.
Research Questions: In Milestone 1, you chose two research questions to focus on. Wherever realistic, your work in this milestone should relate to these research questions whenever we ask for justification behind your work. In the case that some tasks in this milestone don’t align well with one of your research questions, feel free to discuss your results in the context of a different research question.
By the end of this milestone, you should:
tidyr.Begin by loading your data and the tidyverse package below:
library(datateachr) # <- might contain the data you picked!
library(tidyverse)
library(reactable)
library(ggplot2)
From milestone 1, you should have an idea of the basic structure of your dataset (e.g. number of rows and columns, class types, etc.). Here, we will start investigating your data more in-depth using various data manipulation functions.
First, write out the 4 research questions you defined in milestone 1 were. This will guide your work through milestone 2:
For each genus, what common names of trees are found in Vancouver?
Is there a relationship between the diameter of trees and the age of trees? Analyze for the most common tree type
What is the age vs height relationship of the most common tree?
Is there a relationship between the height of trees and the diameter of trees?
Here, we will investigate your data using various data manipulation and graphing functions.
Now, for each of your four research questions, choose one task from options 1-4 (summarizing), and one other task from 4-8 (graphing). You should have 2 tasks done for each research question (8 total). Make sure it makes sense to do them! (e.g. don’t use a numerical variables for a task that needs a categorical variable.). Comment on why each task helps (or doesn’t!) answer the corresponding research question.
Ensure that the output of each operation is printed!
Also make sure that you’re using dplyr and ggplot2 rather than base R. Outside of this project, you may find that you prefer using base R functions for certain tasks, and that’s just fine! But part of this project is for you to practice the tools we learned in class, which is dplyr and ggplot2.
Summarizing:
table()!table()!Graphing:
Using variables and/or tables you made in one of the “Summarizing” tasks:
Make sure it’s clear what research question you are doing each operation for!
For each genus, what common names of trees are found in Vancouver?
Summarizing: Computing the number of genus_name observations and number of common name observations. This helps to understand how many genus names and common names of trees are in our data
van_trees_fct <- vancouver_trees %>%
mutate(genus_name = factor(genus_name),
common_name = factor(common_name))
n_genus_name = nlevels(van_trees_fct$genus_name)
n_common_name = nlevels(van_trees_fct$common_name)
head(van_trees_fct)
## # A tibble: 6 × 20
## tree_id civic_number std_street genus_name species_name cultivar_name
## <dbl> <dbl> <chr> <fct> <chr> <chr>
## 1 149556 494 W 58TH AV ULMUS AMERICANA BRANDON
## 2 149563 450 W 58TH AV ZELKOVA SERRATA <NA>
## 3 149579 4994 WINDSOR ST STYRAX JAPONICA <NA>
## 4 149590 858 E 39TH AV FRAXINUS AMERICANA AUTUMN APPLAUSE
## 5 149604 5032 WINDSOR ST ACER CAMPESTRE <NA>
## 6 149616 585 W 61ST AV PYRUS CALLERYANA CHANTICLEER
## # ℹ 14 more variables: common_name <fct>, assigned <chr>, root_barrier <chr>,
## # plant_area <chr>, on_street_block <dbl>, on_street <chr>,
## # neighbourhood_name <chr>, street_side_name <chr>, height_range_id <dbl>,
## # diameter <dbl>, curb <chr>, date_planted <date>, longitude <dbl>,
## # latitude <dbl>
cat(paste("Number of genus_name observations:", n_genus_name))
## Number of genus_name observations: 97
cat(paste("Number of common_name observations:", n_common_name))
## Number of common_name observations: 634
genus <- van_trees_fct %>%
group_by(genus_name) %>%
summarise(frequency = n())
genus_freq <- genus %>%
mutate(genus_name = fct_infreq(genus_name))
reactable(genus_freq)
genus_common_freq <- van_trees_fct %>%
group_by(genus_name, common_name) %>%
summarise(frequency = n())
## `summarise()` has grouped output by 'genus_name'. You can override using the
## `.groups` argument.
reactable(genus_common_freq)
Graphing: This helps to quickly see visually which genus are more common in Vancouver
genus_barplot <- genus_freq %>%
rename(count = frequency) %>%
filter(count > 500) %>%
ggplot(aes(genus_name, count)) +
geom_bar(stat = "identity") +
theme_minimal() +
xlab("Genus Name") +
ylab("Tree Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Tree Counts for Genus Names with Counts > 500")
print(genus_barplot)
Is there a relationship between the diameter of trees and the age of trees? Analyze for the most common tree type
Summarizing: Here, we are pulling out the most common tree in Vancouver to analyze age vs diameter data
today = Sys.Date() #gets today's date
vancouver_trees_age <- vancouver_trees %>%
mutate(age_yrs = (as.numeric(difftime(today, date_planted, units = "days"))) / 365.25) %>%
#selects only specific columns
select(genus_name,common_name,height_range_id,diameter,date_planted,age_yrs)
reactable(vancouver_trees_age)
#Pulling out the common tree name that has the most number of trees in Vancouver
max_index <- which.max(genus_common_freq$frequency)
most_common_tree_genus <- genus_common_freq$genus_name[max_index]
most_common_tree <- genus_common_freq$common_name[max_index]
most_common_tree_row <- genus_common_freq[max_index, ]
print(most_common_tree_row)
## # A tibble: 1 × 3
## # Groups: genus_name [1]
## genus_name common_name frequency
## <fct> <fct> <int>
## 1 PRUNUS KWANZAN FLOWERING CHERRY 10529
most_common_tree_all <- vancouver_trees_age %>%
filter(common_name == most_common_tree)
reactable(most_common_tree_all)
Graphing: Here, we will plot age vs diameter of Vancouver’s most common tree to see the correlation
diam_age_plot <- most_common_tree_all %>%
ggplot(aes(age_yrs, diameter)) +
geom_point(alpha = 0.3) +
theme_classic() +
xlab("Age (yrs)") +
ylab("Diameter (in)") +
coord_cartesian(ylim = c(0,40))
print(diam_age_plot)
## Warning: Removed 8980 rows containing missing values (`geom_point()`).
What is the age vs height relationship of the most common tree?
Summarizing:
mc_tree_height <- most_common_tree_all %>%
select(genus_name,common_name,height_range_id,diameter,age_yrs) %>%
filter(!is.na(age_yrs))
print(mc_tree_height)
## # A tibble: 1,549 × 5
## genus_name common_name height_range_id diameter age_yrs
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 PRUNUS KWANZAN FLOWERING CHERRY 2 6.5 26.9
## 2 PRUNUS KWANZAN FLOWERING CHERRY 2 3 28.5
## 3 PRUNUS KWANZAN FLOWERING CHERRY 1 3 11.6
## 4 PRUNUS KWANZAN FLOWERING CHERRY 1 4.75 11.0
## 5 PRUNUS KWANZAN FLOWERING CHERRY 1 7.75 10.0
## 6 PRUNUS KWANZAN FLOWERING CHERRY 2 6 26.9
## 7 PRUNUS KWANZAN FLOWERING CHERRY 2 8.5 24.5
## 8 PRUNUS KWANZAN FLOWERING CHERRY 2 5.5 26.9
## 9 PRUNUS KWANZAN FLOWERING CHERRY 1 5.25 17.8
## 10 PRUNUS KWANZAN FLOWERING CHERRY 1 5 16.8
## # ℹ 1,539 more rows
height_mode <- mc_tree_height %>%
group_by(height_range_id) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
filter(row_number() == 1) %>%
pull(height_range_id)
cat("The mode of the hieght range id for the most common tree in Vancouver is:", height_mode)
## The mode of the hieght range id for the most common tree in Vancouver is: 1
Graphing: Using a jitterplot to understand the age of trees for each height ID
heightID_age_jitterplot <- mc_tree_height %>%
ggplot(aes(height_range_id, age_yrs)) +
geom_jitter(width = 0.1, alpha = 0.5)
print(heightID_age_jitterplot)
Is there a relationship between the height of trees and the diameter of trees?
Summarizing: Find mean diameter of the most common tree in Vancouver
mc_tree_height_diam <- mc_tree_height
mean_diam <- mc_tree_height_diam %>%
summarise(mean = mean(diameter))
print("The mean diameter for the most common tree in Vancouver is: ")
## [1] "The mean diameter for the most common tree in Vancouver is: "
print(mean_diam)
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 6.96
Graphing: Using a jitterplot to understand the diameter of trees for each height ID
heightID_diam_jitterplot <- mc_tree_height_diam %>%
ggplot(aes(height_range_id, diameter)) +
geom_jitter(width = 0.1, alpha = 0.5)
print(heightID_diam_jitterplot)
Based on the operations that you’ve completed, how much closer are you to answering your research questions? Think about what aspects of your research questions remain unclear. Can your research questions be refined, now that you’ve investigated your data a bit more? Which research questions are yielding interesting results?
Much closer! It is really interesting to see the two jitter plots relating height ID vs age and height ID vs diameter!
In this task, we will do several exercises to reshape our data. The
goal here is to understand how to do this reshaping with the
tidyr package.
A reminder of the definition of tidy data:
Based on the definition above, can you identify if your data is tidy or untidy? Go through all your columns, or if you have >8 variables, just pick 8, and explain whether the data is untidy or tidy.
Yes it is tidy, for most of the analysis above, I have selected only the columns I am interested in and have also removed all NA values.
Now, if your data is tidy, untidy it! Then, tidy it back to it’s original state.
If your data is untidy, then tidy it! Then, untidy it back to it’s original state.
Be sure to explain your reasoning for this task. Show us the “before” and “after”.
Now, you should be more familiar with your data, and also have made progress in answering your research questions. Based on your interest, and your analyses, pick 2 of the 4 research questions to continue your analysis in the remaining tasks:
Is there a relationship between the diameter of trees and the age of trees? Analyze for the most common tree type
What is the age vs height relationship of the most common tree?
Explain your decision for choosing the above two research questions.
The plots looked really interesting and I am always interesting to understand how tree age affects height and diameter
Now, try to choose a version of your data that you think will be appropriate to answer these 2 questions. Use between 4 and 8 functions that we’ve covered so far (i.e. by filtering, cleaning, tidy’ing, dropping irrelevant columns, etc.).
(If it makes more sense, then you can make/pick two versions of your data, one for each research question.)
Most clean data for this: (see above for how this was created)
mc_tree_diam_age <- mc_tree_height
print(mc_tree_diam_age)
## # A tibble: 1,549 × 5
## genus_name common_name height_range_id diameter age_yrs
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 PRUNUS KWANZAN FLOWERING CHERRY 2 6.5 26.9
## 2 PRUNUS KWANZAN FLOWERING CHERRY 2 3 28.5
## 3 PRUNUS KWANZAN FLOWERING CHERRY 1 3 11.6
## 4 PRUNUS KWANZAN FLOWERING CHERRY 1 4.75 11.0
## 5 PRUNUS KWANZAN FLOWERING CHERRY 1 7.75 10.0
## 6 PRUNUS KWANZAN FLOWERING CHERRY 2 6 26.9
## 7 PRUNUS KWANZAN FLOWERING CHERRY 2 8.5 24.5
## 8 PRUNUS KWANZAN FLOWERING CHERRY 2 5.5 26.9
## 9 PRUNUS KWANZAN FLOWERING CHERRY 1 5.25 17.8
## 10 PRUNUS KWANZAN FLOWERING CHERRY 1 5 16.8
## # ℹ 1,539 more rows
Most clean data for this: (see above for how this was created)
print(mc_tree_height)
## # A tibble: 1,549 × 5
## genus_name common_name height_range_id diameter age_yrs
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 PRUNUS KWANZAN FLOWERING CHERRY 2 6.5 26.9
## 2 PRUNUS KWANZAN FLOWERING CHERRY 2 3 28.5
## 3 PRUNUS KWANZAN FLOWERING CHERRY 1 3 11.6
## 4 PRUNUS KWANZAN FLOWERING CHERRY 1 4.75 11.0
## 5 PRUNUS KWANZAN FLOWERING CHERRY 1 7.75 10.0
## 6 PRUNUS KWANZAN FLOWERING CHERRY 2 6 26.9
## 7 PRUNUS KWANZAN FLOWERING CHERRY 2 8.5 24.5
## 8 PRUNUS KWANZAN FLOWERING CHERRY 2 5.5 26.9
## 9 PRUNUS KWANZAN FLOWERING CHERRY 1 5.25 17.8
## 10 PRUNUS KWANZAN FLOWERING CHERRY 1 5 16.8
## # ℹ 1,539 more rows
Pick a research question from 1.2, and pick a variable of interest (we’ll call it “Y”) that’s relevant to the research question. Indicate these.
Research Question: Is there a relationship between the diameter of trees and the age of trees? Analyze for the most common tree type
Variable of interest: diameter of trees
Fit a model or run a hypothesis test that provides insight on this variable with respect to the research question. Store the model object as a variable, and print its output to screen. We’ll omit having to justify your choice, because we don’t expect you to know about model specifics in STAT 545.
Note: It’s OK if you don’t know how these models/tests work. Here are some examples of things you can do here, but the sky’s the limit.
lm() function.t.test(), or maybe the mean across two groups are different
using t.test(), or maybe the mean across multiple groups
are different using anova() (you may have to pivot your
data for the latter two).lm() to test for significance of
regression coefficients.model <- lm(diameter ~ age_yrs, data = mc_tree_diam_age)
summary(model)
##
## Call:
## lm(formula = diameter ~ age_yrs, data = mc_tree_diam_age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.267 -1.963 -0.329 1.300 58.502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.70544 0.22177 -3.181 0.0015 **
## age_yrs 0.41139 0.01055 39.013 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.048 on 1547 degrees of freedom
## Multiple R-squared: 0.4959, Adjusted R-squared: 0.4956
## F-statistic: 1522 on 1 and 1547 DF, p-value: < 2.2e-16
Produce something relevant from your fitted model: either predictions on Y, or a single value like a regression coefficient or a p-value.
broom package if
possible. If your model is not compatible with the broom function you’re
needing, then you can obtain your results by some other means, but first
indicate which broom function is not compatible.Get set up for this exercise by making a folder called
output in the top level of your project folder /
repository. You’ll be saving things there.
Take a summary table that you made from Task 1, and write it as a csv
file in your output folder. Use the
here::here() function.
Write your model object from Task 3 to an R binary file (an RDS), and
load it again. Be sure to save the binary file in your
output folder. Use the functions saveRDS() and
readRDS().
Here are the criteria we’re looking for.
The document should read sensibly from top to bottom, with no major continuity errors.
The README file should still satisfy the criteria from the last milestone, i.e. it has been updated to match the changes to the repository made in this milestone.
You should have at least three folders in the top level of your repository: one for each milestone, and one output folder. If there are any other folders, these are explained in the main README.
Each milestone document is contained in its respective folder, and nowhere else.
Every level-1 folder (that is, the ones stored in the top level, like
“Milestone1” and “output”) has a README file, explaining in
a sentence or two what is in the folder, in plain language (it’s enough
to say something like “This folder contains the source for Milestone
1”).
All output is recent and relevant:
knitted to their output md
files.Our recommendation: delete all output files, and re-knit each milestone’s Rmd file, so that everything is up to date and relevant.
You’ve tagged a release for Milestone 2.
Thanks to Victor Yuan for mostly putting this together.